Skip to content

feat: datumctl compute plugin — deploy and manage workloads from the CLI#113

Draft
scotwells wants to merge 30 commits into
feat/federated-deployment-schedulingfrom
feat/datumctl-compute-plugin
Draft

feat: datumctl compute plugin — deploy and manage workloads from the CLI#113
scotwells wants to merge 30 commits into
feat/federated-deployment-schedulingfrom
feat/datumctl-compute-plugin

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

Summary

Adds the datumctl compute plugin so developers can deploy and manage containerized workloads on Datum Cloud directly from the CLI.

Commands shipped:

  • deploy — push a container image as a workload with flags or a manifest file; waits for rollout
  • destroy — tear down a workload with a confirmation prompt
  • status — show workload health, per-city placement summary, and the active revision
  • instances — list all running instances across cities, with describe for full detail
  • scale — adjust minimum replica count across all placements
  • rollout — watch live rollout progress, browse revision history, and roll back to any prior revision
  • restart — trigger a rolling restart of a workload or a specific city
  • quota — inspect per-city instance usage and surface quota-exceeded messages

Revision history is stored as a ConfigMap per workload so rollout history and rollout undo work without server-side tracking.

Dependencies

What's not included

  • logs — telemetry service not yet implemented
  • Tests — next step is adding envtest-based integration tests for each command
  • cities / instance-types resource listing commands

Related

Closes #98. Design proposal in #111.

scotwells added a commit that referenced this pull request May 29, 2026
…cheduling base

After rebasing onto feat/federated-deployment-scheduling, go.mod had picked up
the wrong versions of two deps via conflict resolution:

- go.datum.net/network-services-operator was left at v0.1.0 (from #113's old
  go.mod side) instead of v0.21.10-... required by HEAD's LocationBinding usage
- go.miloapis.com/service-catalog v0.0.0-20260527221104 transitively requires
  milo v0.26.1, which has a broken downstreamclient (Apply method missing,
  ClusterName type mismatch). Add a replace directive to pin milo to v0.25.2
  (the version used by the federated-scheduling base) so downstreamclient
  compiles cleanly. service-catalog is updated to the latest available version.

Also apply gofmt alignment fixes surfaced by the rebase on instance_controller.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@scotwells scotwells changed the base branch from main to feat/federated-deployment-scheduling May 29, 2026 03:30
@scotwells scotwells force-pushed the feat/datumctl-compute-plugin branch from a63c87a to c1186cb Compare May 29, 2026 03:30
scotwells and others added 23 commits May 28, 2026 22:33
Adds the datumctl-compute plugin binary with commands for deploying and
managing containerized workloads on Datum Cloud via the developer CLI.

Commands:
- deploy     — create or update a workload from flags or a manifest file
- destroy    — delete a workload and clean up its revision history
- status     — show health, placement summary, and recent revision info
- instances  — list and describe running instances across cities
- scale      — adjust minimum replica count across placements
- rollout    — watch live progress, view history, and roll back revisions
- restart    — trigger a rolling restart of a workload or specific city
- quota      — inspect per-city instance usage and quota headroom

Closes #98. Depends on datum-cloud/datumctl#198.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Within a project's virtual control plane, all resources live in the
"default" namespace — the project slug is only used to route to the
right control plane URL. Updated all commands to use
util.ResourceNamespace ("default") instead of the project name as the
k8s namespace.

Also corrects the instance type default from "d1-standard-2" to
"datumcloud/d1-standard-2" to match the format the admission webhook
requires.

Discovered while testing against the staging environment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The datumctl module requirement was upgrading controller-runtime to
v0.23.3, which broke compatibility with multicluster-runtime and milo.
Eliminated the dependency by:

- Inlining the --plugin-manifest protocol in main.go
- Reading DATUM_API_HOST and DATUM_CREDENTIALS_HELPER from env directly
  in util/client.go instead of via plugin.Context()/plugin.Token()
- Reading DATUM_ORG from env in root.go instead of via plugin.NewRootCmd
- Dropping the now-unreachable internal/cmd/compute/client.go

Also updates CI workflows to use go-version-file instead of a pinned
go 1.24.0, and bumps golangci-lint to v2.12.2 which supports go 1.25.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Upgrades controller-runtime from v0.21.0 to v0.23.3 and multicluster-runtime
from v0.21.0-alpha.8 to v0.23.3, which unblocks adding go.datum.net/datumctl
as a direct dependency.

The CLI plugin (datumctl-compute) now uses the official datumctl plugin SDK:
- plugin.ServeManifest() for the --plugin-manifest protocol
- plugin.NewRootCmd() for pre-wired org/project/output flags
- plugin.Context() and plugin.Token() for credential access

Controller breaking changes addressed: ClusterName distinct type, Watches
callback signature, NewWebhookManagedBy generic API. A local milo provider
fork is added at internal/provider/milo since the upstream package hasn't
been updated for the ClusterName type change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Addresses 63 lint findings across errcheck, goconst, gocyclo, gofmt,
prealloc, staticcheck, and unparam linters:

- gofmt/goimports: reformat cmd/main.go, deploy.go, util/client.go, webhook
- errcheck: assign discarded fmt.Fprint* and Flush returns to _
- staticcheck: update webhook to generic admission.Defaulter[T]/Validator[T]
  with WithDefaulter/WithValidator; fix SA4010 unused append in quota.go;
  remove redundant .ObjectMeta selectors in restart.go
- unparam: rename four never-used function parameters to _
- gocyclo: extract helpers from watch.Rollout and quota.runQuota to reduce
  cyclomatic complexity below threshold
- goconst: extract repeated string literals to named constants across
  controllers, validation, and tests
- prealloc: preallocate slices with known capacity in validation and tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- errcheck: fix unchecked fmt.Fprint* returns in deploy, quota, rollout, scale
- prealloc: preallocate allErrs in workload_validation.go and stateful test
- gofmt: reformat destroy.go, instances.go, rollout.go

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- golangci.yml: exclude errcheck for internal/cmd/* — ignoring write
  errors on stdout/stderr is idiomatic in CLI tools
- prealloc: preallocate allErrs in validateScaleSettingMetrics
- gofmt: reformat status.go, instance_controller_test.go

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire ValidArgsFunction on every command that accepts a workload name
(deploy, destroy, restart, rollout, rollout history, rollout undo,
scale, status) and register flag completion for instances --workload.

All completions call a shared CompleteWorkloadNames helper in
internal/cmd/compute/util that fetches live workload names from the
API and always returns ShellCompDirectiveNoFileComp so the shell
never falls back to filename completion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove ValidArgsFunction from deploy and replace with
  util.CompleteWorkloadNamesAndFlags, which wraps CompleteWorkloadNames
  with plugin.WithFlagCompletion from the datumctl SDK.
- Add plugin.WithFlagCompletion to the datumctl plugin SDK so any plugin
  can get the same behaviour by wrapping their own ValidArgsFunction.
- Bump go.datum.net/datumctl to b44de1c (adds WithFlagCompletion).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove the hardcoded datum-control-plane ClusterIssuer from the
csi-webhook-cert component. DNS names stay since they are fixed by the
service name and namespace. Each consuming overlay now supplies the issuer
via a strategic merge patch, allowing different environments to use
different cert issuers without forking the component.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cert issuer name is environment-specific configuration that belongs
in the infra repo, not the compute overlay. The infra repo's base manager
patch already owns the full webhook-server-tls volume definition including
the issuer. Consumers deploying outside infra must patch the issuer in their
own overlay.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a printer.go with PrintJSON and PrintYAML helpers that commands can
use to emit API resources as structured output. Extend completion.go with
CompleteInstanceNames, CompleteCityCodes, and CompleteOutputFormats so all
-o/--output, --city, and instance-name completions are driven from a
single shared source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both commands now accept -o/--output with tab-completion. json/yaml emit
the underlying API resource (InstanceList) or structured quota rows
respectively. wide adds an INSTANCE TYPE column for instances. --no-headers
suppresses the header row for table and wide. City completion is wired to
CompleteCityCodes and instance describe gains tab-completion via
CompleteInstanceNames.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add datumctl compute workloads (list) and workloads describe <name>
commands. The list command shows NAME/HEALTH/READY/PLACEMENTS/IMAGE/AGE
columns with --health and --city filters, -o table|wide|json|yaml, and a
footer summary. The describe command replaces status with a unified
config+health view: header block, per-placement per-city ready counts with
inline degradation annotations, and a container spec block. Remove the
now-redundant status command from root.go and delete its package.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix duplicate TYPE/INSTANCE TYPE columns in instances -o wide (W3):
  populate TYPE from runtimeKind (sandbox/vm), INSTANCE TYPE from instType
- Fix footer bucketing in instances list (W4): compute Running/Pending/Failed
  from actual status strings instead of hardcoding Failed=0
- Skip revision ConfigMap Gets in workloads list table mode (W5): only
  fetch per-workload revision when -o wide is requested, avoiding N
  round-trips on every list invocation
- Compute health footer tallies after filters are applied (W9): previously
  counted all workloads then printed a filtered subset, making the summary
  misleading when --health or --city filters were active
- Fix gofmt import ordering in workloads.go (B1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Before creating a workload, the deploy command now checks whether the
required network(s) exist. If a network is missing, the user is offered
the option to create a minimal auto-IPAM network in-place rather than
hitting an opaque NetworkNotFound error post-submission.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… API

- Add EnsureComputeEntitlement to gate all compute commands on an active
  service entitlement; prompts TTY users to request access and surfaces
  approval status
- Rewrite quota command to query AllowanceBucket resources from the
  project VCP (milo-system namespace) instead of deriving usage from
  instance quota conditions
- Add NewPlatformClient targeting the platform API server for
  ResourceRegistration lookups
- Extract ListServiceQuota into util so other service plugins can reuse
  the quota display logic with their own resource type prefix and
  display metadata overrides

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace hand-rolled HTTP entitlement code with a proper client-go
implementation using go.miloapis.com/service-catalog types. Uses
client.WithWatch to stream events from the API server and unblocks
as soon as the Ready condition appears — no polling interval.

Also adds ASCII progress bar to quota table output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The compute CLI client now serializes network-services-operator types
(Network, NetworkBinding, SubnetClaim), so deploy can preflight and
create networks on the user's behalf.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Deployment revisions are becoming a platform concept rather than a
client concern. Remove the ConfigMap-backed revision ledger the CLI
maintained per workload, along with the 'rollout history' and 'rollout
undo' subcommands and the revision column in 'workloads'. 'rollout'
remains as a live-progress watch.

This also removes the only code path that serialized core/v1 ConfigMaps
from the CLI, so the missing-corev1-scheme warning on deploy no longer
occurs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cheduling base

After rebasing onto feat/federated-deployment-scheduling, go.mod had picked up
the wrong versions of two deps via conflict resolution:

- go.datum.net/network-services-operator was left at v0.1.0 (from #113's old
  go.mod side) instead of v0.21.10-... required by HEAD's LocationBinding usage
- go.miloapis.com/service-catalog v0.0.0-20260527221104 transitively requires
  milo v0.26.1, which has a broken downstreamclient (Apply method missing,
  ClusterName type mismatch). Add a replace directive to pin milo to v0.25.2
  (the version used by the federated-scheduling base) so downstreamclient
  compiles cleanly. service-catalog is updated to the latest available version.

Also apply gofmt alignment fixes surfaced by the rebase on instance_controller.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… resolution

The first conflict resolution in the aa9dc15 commit accidentally truncated
workload_webhook.go, dropping the ValidateCreate method, its kubebuilder
marker, and producing a syntactically invalid Default function body
(extra brace + wrong return signature). Restore the file to match
5486adf's content (the authoritative post-lint-migration version).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The platform now stamps city-code, workload-name, workload-deployment-name,
and placement-name directly onto Instances at creation time. The CLI can
therefore resolve CITY/WORKLOAD/placement directly from those labels without
performing cross-object joins.

The prior approach keyed the WorkloadDeployment map on UID and looked up
instances via WorkloadDeploymentUIDLabel. That UID is the edge/Karmada WD UID,
which differs from the project-cluster WD UID, causing the join to fail across
federation planes and producing "unknown"/"orphaned" output.

The new label-first path reads CityCodeLabel, WorkloadNameLabel,
PlacementNameLabel, and WorkloadDeploymentNameLabel (name is identical across
all planes) before falling back to the WD Get/List join. A wdNameFromInstanceName
helper strips the trailing ordinal suffix from the Instance name as a last-resort
fallback for instances created before the labels existed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scotwells and others added 3 commits June 1, 2026 15:43
The `compute deploy` rollout watcher reported PHASE=Done and exited
within seconds of creating the workload, before any instances were
scheduled. A WorkloadDeployment's Status.DesiredReplicas stays at zero
until the controller first reconciles it, and computePhase treated zero
desired as Done — so the very first poll of a fresh deployment looked
complete.

Resolve the wait target from the spec minimum while the controller has
not yet reported a desired count, and require that no stale replicas
remain before reporting Done so scale-downs and rolling updates aren't
declared complete while old instances are still draining.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scotwells and others added 3 commits June 1, 2026 20:10
Consume the server-side status-blocking-reason contract: each resource's
readiness condition (Instance/Ready, WorkloadDeployment/Available,
Workload/Available) now carries a machine-readable reason and human message
when not True.

- Add ReadinessBlock helper in util/conditions.go: given a condition list and
  type, returns (reason, message, blocked) with no per-reason branching —
  the single reusable entry-point for the new contract.
- InstanceStatus (list view): falls through to "Pending (<reason>)" from the
  Ready condition when no specific sub-condition check matches, replacing the
  bare "Pending" for unknown causes like SourceNotFound or ReferencedDataNotReady.
- InstanceStatusDetail (describe view): falls through to "Pending — <reason>"
  with the message as detail, replacing "Unknown" for those same causes.
- WorkloadHealth: surfaces the reason from Available when false, e.g.
  "Unavailable — SourceNotFound" instead of the generic message.
- degradedAnnotation (workloads describe per-city line): rewritten to read the
  WorkloadDeployment's own Available condition; removes the per-instance List
  fetch and the quota/InstanceStatusDetail special-casing that was its only logic.
- printBlockedDetail (rollout watch): rewritten to read the deployment's
  Available condition; removes the per-instance List fetch entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rovisioning status

The Programmed condition starts as Unknown (not False) while programming
is in progress, so the ConditionFalse-only checks were bypassed and the
raw ProgrammingInProgress reason leaked through the Ready condition
fallback. Widen the checks to status != True to cover both Unknown and
False states.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add three provider-emitted reason constants to the API types and map
them to plain-English STATUS strings in the list and describe views:

  ImageUnavailable  → Failed (image unavailable)
  InstanceCrashing  → Failed (crashing)
  ConfigurationError → Failed (configuration error)

Rename the PendingProgramming/ProgrammingInProgress cases from the
misleading "network provisioning" to "Starting", which accurately
describes the transient state without implying network work is involved.

Failed statuses are already counted in the "N Failed" summary line via
the existing strings.HasPrefix(status, "Failed") check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@scotwells
Copy link
Copy Markdown
Contributor Author

📋 Real-world UX issue from a user enabling compute

Heads up — we got a user report that surfaces a confusing first-run experience with the enablement flow, and I've traced it end-to-end via the staging audit logs. Sharing here since the fix touches this plugin.

What the user saw:

% datumctl compute instances list
Compute is not enabled for project "personal-project-153fe986".
Would you like to request access? [y/N]: y
Requesting access to compute for project "personal-project-153fe986"...
Error: requesting compute access: serviceentitlements.services.miloapis.com "compute" already exists

From their perspective this looks like a flat-out failure. In reality, their first attempt succeeded — compute was enabled.

What actually happened (from the audit trail):

  1. First run created the entitlement successfully. ✅
  2. But the backend takes a short while (~minutes in this case) to mark it Ready.
  3. During that window, the CLI's "is compute enabled?" check keys off the entitlement's Ready status, not its existence — so it kept reporting "not enabled" and re-offering to request access.
  4. Each retry tried to create the entitlement again and hit a 409 already exists, which we surfaced as a raw, scary error.

Why it matters for the product: the very first thing a new user does is turn compute on, and today that happy path can look broken even when it worked. The error message also leaks an internal resource name (serviceentitlements.services.miloapis.com) that means nothing to a user.

Proposed fix (branch fix/compute-entitlement-pending-state, built off this PR's branch): teach the enablement check to distinguish three states instead of two —

  • not requested → offer to request access (today's behavior)
  • requested but still activating → tell the user it's in progress and to try again in a moment (no re-prompt, no error)
  • active → proceed

…and treat a 409 already exists as "already requested, activation pending" rather than a fatal error. Net result: the user sees a calm "enablement in progress, hang tight" message instead of a stack of confusing failures.

Happy to fold this into this PR or send it as a follow-up — whichever you prefer. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define the UX, DX, and AX for deploying and managing compute workloads

1 participant